85 research outputs found
Gender homophily in online book networks
We measure the gender homophily (and other network statistics) on large-scale online book markets: amazon.com and amazon.co.uk, using datasets describing millions of books sold to readers. Large book networks are created by sales (two books are connected if many readers have bought both books) and can recommend new books to buy. The networks are analysed by the gender of their first author: is book consumption assortative by gender? Book networks are indeed gender-assortative: readers globally prefer to read from one author gender (the global assortativity coefficients by gender is around 0.4). Although 33% of first authors among all books are female, female books are not proportionally sold together with male books: an average of 20% (and median of 11%) of books co-bought with male books are female books. Instead, female books make up on average more than half of the books co-bought with other female books. The gender makeup of literary genres and structural book communities show that the gender homophily originates in a gender skew not only in certain literary genres (a fact known from prior studies), but even more strongly in certain book communities, with these book communities spanning multiple literary genres
The semantics of constellation line figures
We answer the question whether, when forming constellations in the night sky,
people in astronomical cultures around the world consistently imagine and
assign the same symbolism to the same (type of) star cluster. Evidence of
semantic universality has so far been anecdotal. We use two complementary
definitions for a star cluster: (1) a star group in a particular sky region
(regardless of its exact shape), and (2) a star group with a particular shape
and brightness (regardless of its location in the sky). Over a dataset of 1903
constellations from 75 astronomical cultures, we find semantic parallels which
are likely culturally induced: body parts in the sky region delineated by the
International Astronomical Union (IAU) as Ori, fish in Cru and Sco, geometric
symbols in Cru, groups in UMa, mammals in UMa, and reptiles in Sco.
Surprisingly, we find many more significant semantic parallels which can only
be naturally induced by the shape and composition of the star pattern
underlying a constellation (or, are endogenous to the sky rather than
culture-dependent): arthropods in IAU Sco, body parts in Tau, geometric and
group symbols in star clusters (regardless of sky region) with a small number
of bright stars comparable in magnitude, humanoids and mammals naturalistically
drawn in star clusters with large spatial diameter and many stars, landscapes
in IAU Eri, man-made objects of various types in many IAU regions, and reptiles
consistently drawn in star clusters with low aspect ratio or low branching in
the minimum spanning tree drawn over the stars. These naturally induced
semantics show that there are universal (rather than only cultural) thought
patterns behind forming and naming constellations.Comment: Part 2 of arXiv:2110.12329 published in PLOS ONE 17(7): e0272270
(2022). Shares the same datase
Top influencers can be identified universally by combining classical centralities
Information flow, opinion, and epidemics spread over structured networks.
When using individual node centrality indicators to predict which nodes will be
among the top influencers or spreaders in a large network, no single centrality
has consistently good ranking power. We show that statistical classifiers using
two or more centralities as input are instead consistently predictive over many
diverse, static real-world topologies. Certain pairs of centralities cooperate
particularly well in statistically drawing the boundary between the top
spreaders and the rest: local centralities measuring the size of a node's
neighbourhood benefit from the addition of a global centrality such as the
eigenvector centrality, closeness, or the core number. This is, intuitively,
because a local centrality may rank highly some nodes which are located in
dense, but peripheral regions of the network---a situation in which an
additional global centrality indicator can help by prioritising nodes located
more centrally. The nodes selected as superspreaders will usually jointly
maximise the values of both centralities. As a result of the interplay between
centrality indicators, training classifiers with seven classical indicators
leads to a nearly maximum average precision function (0.995) across the
networks in this study.Comment: 14 pages, 10 figures, 4 supplementary figure
Beyond ranking nodes: Predicting epidemic outbreak sizes by network centralities
Identifying important nodes for disease spreading is a central topic in
network epidemiology. We investigate how well the position of a node,
characterized by standard network measures, can predict its epidemiological
importance in any graph of a given number of nodes. This is in contrast to
other studies that deal with the easier prediction problem of ranking nodes by
their epidemic importance in given graphs. As a benchmark for epidemic
importance, we calculate the exact expected outbreak size given a node as the
source. We study exhaustively all graphs of a given size, so do not restrict
ourselves to certain generative models for graphs, nor to graph data sets. Due
to the large number of possible nonisomorphic graphs of a fixed size, we are
limited to 10-node graphs. We find that combinations of two or more
centralities are predictive ( scores of 0.91 or higher) even for the most
difficult parameter values of the epidemic simulation. Typically, these
successful combinations include one normalized spectral centralities (such as
PageRank or Katz centrality) and one measure that is sensitive to the number of
edges in the graph
Improved search methods for assessing Delay-Tolerant Networks vulnerability to colluding strong heterogeneous attacks
Increasingly more digital communication is routed among wireless, mobile computers over ad-hoc, unsecured communication channels. In this paper, we design two stochastic search algorithms (a greedy heuristic, and an evolutionary algorithm) which automatically search for strong insider attack methods against a given ad-hoc, delay-tolerant communication protocol, and thus expose its weaknesses. To assess their performance, we apply the two algorithms to two simulated, large-scale mobile scenarios (of different route morphology) with 200 nodes having free range of movement. We investigate a choice of two standard attack strategies (dropping messages and flooding the network), and four delay-tolerant routing protocols: First Contact, Epidemic, Spray and Wait, and MaxProp. We find dramatic drops in performance: replicative protocols (Epidemic, Spray and Wait, MaxProp), formerly deemed resilient, are compromised to different degrees (delivery rates between 24% and 87%), while a forwarding protocol (First Contact) is shown to drop delivery rates to under 5% — in all cases by well-crafted attack strategies and with an attacker group of size less than 10% the total network size. Overall, we show that the two proposed methods combined constitute an effective means to discover (at design-time) and raise awareness about the weaknesses and strengths of existing ad-hoc, delay-tolerant communication protocols against potential malicious cyber-attacks
Large-scale multi-objective influence maximisation with network downscaling
Finding the most influential nodes in a network is a computationally hard
problem with several possible applications in various kinds of network-based
problems. While several methods have been proposed for tackling the influence
maximisation (IM) problem, their runtime typically scales poorly when the
network size increases. Here, we propose an original method, based on network
downscaling, that allows a multi-objective evolutionary algorithm (MOEA) to
solve the IM problem on a reduced scale network, while preserving the relevant
properties of the original network. The downscaled solution is then upscaled to
the original network, using a mechanism based on centrality metrics such as
PageRank. Our results on eight large networks (including two with 50k
nodes) demonstrate the effectiveness of the proposed method with a more than
10-fold runtime gain compared to the time needed on the original network, and
an up to time reduction compared to CELF
Independent Prototype Propagation for Zero-Shot Compositionality
Humans are good at compositional zero-shot reasoning; someone who has never
seen a zebra before could nevertheless recognize one when we tell them it looks
like a horse with black and white stripes. Machine learning systems, on the
other hand, usually leverage spurious correlations in the training data, and
while such correlations can help recognize objects in context, they hurt
generalization. To be able to deal with underspecified datasets while still
leveraging contextual clues during classification, we propose ProtoProp, a
novel prototype propagation graph method. First we learn prototypical
representations of objects (e.g., zebra) that are conditionally independent
w.r.t. their attribute labels (e.g., stripes) and vice versa. Next we propagate
the independent prototypes through a compositional graph, to learn
compositional prototypes of novel attribute-object combinations that reflect
the dependencies of the target distribution. The method does not rely on any
external data, such as class hierarchy graphs or pretrained word embeddings. We
evaluate our approach on AO-Clever, a synthetic and strongly visual dataset
with clean labels, and UT-Zappos, a noisy real-world dataset of fine-grained
shoe types. We show that in the generalized compositional zero-shot setting we
outperform state-of-the-art results, and through ablations we show the
importance of each part of the method and their contribution to the final
results
Automated fault tree learning from continuous-valued sensor data: a case study on domestic heaters
Many industrial sectors have been collecting big sensor data. With recent
technologies for processing big data, companies can exploit this for automatic
failure detection and prevention. We propose the first completely automated
method for failure analysis, machine-learning fault trees from raw
observational data with continuous variables. Our method scales well and is
tested on a real-world, five-year dataset of domestic heater operations in The
Netherlands, with 31 million unique heater-day readings, each containing 27
sensor and 11 failure variables. Our method builds on two previous procedures:
the C4.5 decision-tree learning algorithm, and the LIFT fault tree learning
algorithm from Boolean data. C4.5 pre-processes each continuous variable: it
learns an optimal numerical threshold which distinguishes between faulty and
normal operation of the top-level system. These thresholds discretise the
variables, thus allowing LIFT to learn fault trees which model the root failure
mechanisms of the system and are explainable. We obtain fault trees for the 11
failure variables, and evaluate them in two ways: quantitatively, with a
significance score, and qualitatively, with domain specialists. Some of the
fault trees learnt have almost maximum significance (above 0.95), while others
have medium-to-low significance (around 0.30), reflecting the difficulty of
learning from big, noisy, real-world sensor data. The domain specialists
confirm that the fault trees model meaningful relationships among the
variables.Comment: Preprint submitted to the International Journal of Prognostics and
Health Management - March 202
- …